Fast Community Identification by Hierarchical Growth 
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A new method for community identification is proposed which is founded on the analysis of 
successive neighborhoods, reached through hierarchical growth from a starting vertex, and on the 
definition of communities as a subgraph whose number of inner connections is larger than outer 
connections. In order to determine the precision and speed of the method, it is compared with one 
of the most popular community identification approaches, namely Girvan and Newman's algorithm. 
Although the hierarchical growth method is not as precise as Girvan and Newman's method, it is 
potentially faster than most community finding algorithms. 



I. INTRODUCTION 

Lying at the intersection between graph theory and 
statistical mechanics, complex networks exhibit great 
generality, which has allowed applications to many areas 
such as modeling of biological systems [l| , social interac- 
tions an d information networks Ja,0|j to cite 
just a few 

As this research area comes of age, a large toolkit is 
now available to characterize and model complex net- 
works (e.g. surveys 0, El O, 0, E])- An impor- 
tant problem which has been subject of great interest re- 
cently concerns the identification of modules of densely 
connected vertices in networks, the so-called communi- 
ties. These structures result from interactions between 
the network components, defining^ structural connecting 
patterns in social networks 0, fl3l ] , metabolic networks 
as well as the worldwide air transportation network 

Despite the intense efforts dedicated to community 
finding, no consensus has been reached on how to de- 
fine communities ^(|. Radichi et al. 0| suggested the 
two following definitions. In a strong sense, a subgraph 
is a community if all of its vertices are more intensely 
connected one another than with the rest of the network. 
In a weak sense, a subgraph corresponds to a commu- 
nity whenever the number of edges inside the subgraph 
is larger than the number of connections established with 
the remainder of the network. 

Along the last few years, many methods have been pro- 
posed for community identification based on a variety of 
distinct approaches such as: (i) link removal, as used 
by Girvan and Newman 0] and Radicchi et al. [l7| : 
(ii) spectral graph partitioning |T^; (iii) agglomerative 
methods, including hierarchical clustering |2fl l2ll : (iv) 
maximization of the modularity, as in Newman |3( and 
Duch and Arenas [22|]; and (v) consideration of succes- 
sive neighborhoods through hierarchical growth emanat- 
ing from hubs [2^, [24|. A good survey of community 
identification methods has been provided by Newman 
[25| and Danon et al. 0. This subject has also been 
partially addressed in the surveys by Costa et al. and 



Boccaletti et al [Tcf . 

Arguably, the most popular method for community 
identification is that proposed by Girvan and New- 
man 0]. This approach considers that the edges in- 
terconnecting communities correspond to bottlenecks be- 
tween the communities, so that the removal of such edges 
tend to partition the network into communities. The bot- 
tleneck edges are identified in terms of a measurement 
called edge betweenness, which is given by the number of 
shortest paths between pairs of vertices that run along 
the edge. This algorithm has been proven to be effective 
for obtaining communities in several types of networks. 
However, its effectiveness implies a computational cost of 
order 0{n 2 m) in a network with m edges and n vertices. 
An alternative algorithm to calculate betweenness cen- 
trality, based on random walks, has been proposed p6| 
which, although conceptually interesting, is also compu- 
tationally demanding. 

The method described in the present article overcomes 
tends to run faster than the Girvan-Newman's algorithm 
while offering reasonable, though smaller, precision for 
identification of communities. It is based on the con- 
sideration of successive neighborhoods of a set of seeds, 
implemented through hierarchical growth. Starting from 
a vertex (seed), the links of its successive neighborhood 
are analyzed in order to verify if they belong to the same 
community than the seed. This process starts from each 
vertex in the network and, at each step, inter-community 
edges are removed splitting the network into communi- 
ties. 

A related approach was previously proposed by 
Costa 23], who developed a method based on the flood- 
ing the network with wavefronts of labels emanating si- 
multaneously from hubs. The expanding region of each 
label was implemented in terms of hierarchical growth 
from the starting hubs and the communities are found 
when the wavefronts of labels touch each one. Com- 
petitions along the propagating neighborhoods are de- 
cided by considering an additional criterion involving the 
moda of the labels at the border of the neighborhood 
and the number of emanating connections. The possibil- 
ity to detect communities by using expanding neighbor- 
hoods has also been addressed by Bagrow and Bollt [24| , 
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who proposed an algorithm based on the growth of an l- 
shell starting from a vertex vq , with the process stopping 
whenever the rate of expansion is found to fall bellow an 
arbitrary threshold. The I- shell is composed by a set of 
vertices placed at distance I from the vertex vq , which is 
analogous to the concept of ring defined by Costa pTlEsI ] 
in order to introduce hierarchical measurements. At each 
expansion, the total emerging degree of a shell of depth 
I is calculated as corresponding to the sum of the emerg- 
ing degree of each vertex at distance I from vq, i. e. the 
degree of i minus the number of links that connect i with 
vertices inside the shell (analogous to the concept of hi- 
erarchical degree introduced by Costa 113,123). When 
the rate between the total emerging degree at distance 
I and Z — 1 is shorter than a given threshold, the set 
of vertices inside the l-shell is classified as a community. 
Despite its simplicity, the determination of the local com- 
munity is accurate just when the vertex vq is equidistant 
from all parts of its enclosing community 24] . In order 
to overcome this limitation, Bragrow and Bollt suggested 
starting from each vertex and then find a consensus parti- 
tioning of the network using a membership matrix. Such 
an approach makes the algorithm more precise. On the 
other hand, it is slow because it requires sorting the mem- 
bership matrix, which is of order 0(n 3 ). 

The method reported in the present article also in- 
volves the consideration of expanding neighborhoods and 
completion of growth in terms of rate of expansion. How- 
ever, it differs from the method of Bagrow and Bollt be- 
cause it analyzes the connections of each vertex at the 
border of the community individually instead of all ver- 
tices at same time. Besides, it considers not only the first 
neighborhood of the community, but the second one too. 
At each expansion from an starting vertices, edges can 
be removed considering two trials based on the first and 
second neighborhood of the enclosing community. An- 
other difference is that our method uses a threshold just 
at the second neighborhood, whose value is determined 
so as to obtain the best value of the modularity, i. e. 
the value of this threshold varies from to a maximum 
value and at each variation it is computed the modular- 
ity. The procedure is to that used by Girvan-Newman, 
as the modularity is calculated at each edge removal. 

The next sections describe the suggested method as 
well as its application to community detection in real 
and in computer generated networks. A comparison with 
the Girvan-Newman method in terms of precision and 
execution time is also presented and discussed. 



II. HIERARCHICAL GROWTH METHOD 

A community is formed by a set of densely connected 
vertices which is sparsely connected with the remain- 
der of the network. The proposed hierarchical growth 
method finds communities by considering two expanding 
neighborhoods. The first neighborhood of a given vertex 
is composed by those vertices at a distance of one edge 



from that vertex. Similarly, the set of vertices at distance 
of two edges from that given vertex constitutes its sec- 
ond neighborhood. Following this definition, two steps 
are performed in order to determine if a given vertex i 
located in the first neighborhood of a known community 
belongs to this community, i.e. 

1. 



where k ini (i) is the number of links of the vertex i 
with vertices belonging to community and with ver- 
tices in the first neighborhood, and k outl (i) is the 
number of links between the vertex i and vertices 
in the remainder of the network. 

2. 



where fci„ 2 (i) is the number of links of the neigh- 
bors of i located in the second community neighbor- 
hood with vertices belonging to the first neighbor- 
hood, and k out2 (i) is the number of links between 
the neighbors of i and vertices in the remainder of 
the network. The parameter a varies from 1 to a 
threshold value which is determined according to 
the higher value of the modularity. 

The first condition is sufficient to determine if a vertex 
belongs to the community, but it is not necessary. The 
coefficient a acts as a threshold ranging from one to a 
maximum value. The extension of the current method 
for weighted network is straightforward. 

The hierarchical growth starts from each vertex of the 
network at each step, with the vertices with highest clus- 
tering coefficient Q selected first because they are more 
likely to be inside communities. So, the first and/or the 
second conditions are analyzed at each step, while the 
ring between the starting vertex grows, adding vertices 
to the community or removing edges. Nodes satisfying 
the first and/or the second conditions (equationsnand|2J 
are added to the community. Otherwise, their links with 
the known community are removed. Figure ^ illustrates 
a simple application example of the method. In order to 
determine the best division of the network the thresh- 
old a is varied from to a maximum value and at each 
variation, the modularity Q is computed. The modular- 
ity is a measure of the quality of a particular division of 
networks [2(|. If a particular network is to be split in c 
communities, Q is computed defining a symmetric c x c 
matrix E whose elements of diagonal, en, give the con- 
nections between vertices in the same community and the 
remainder elements, e^, give the number of connections 
between the communities i and j, 

q = Et e « - E e ^) 2 ] = Tr (£) - \\ e2 \i (3) 

» 3 
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FIG. 1: Application example of the hiearchical growth. The 
process is started at vertex 0. Its neighborhood, indicated 
by black vertices, are analyzed next, and the vertices 1,2,4 
and 5 are added to community following the first condition 
(equation 0. The vertex 3 is added to community following 
the second condition (equation [5] with a = 1). The current 
community neighborhood (gray vertices) is then checked, and 
the vertices 6, 7 and 8 are added because of the first condition. 
Next, the links between the community and the vertices 9 and 
10 are removed, splitting the network into two communities. 



Algorithm 1: The general algorithm for the hierarchical 
growth method. 

for each vertex of the network do 

put the next vertex with highest clustering coefficient 

value in C 

while C does not stop growing do 
put the neighbors of C in 7Z 
for each vertex i in 1Z do 
compute k ini (i) and k outl (i) 
if kim (i) > k outl (i) then 
insert the vertex i in C 
else 

select the neighbors of 1Z and put in 72.1 
compute k irl2 (i) and k out2 (i) 
if k in2 (i) > ak out2 (i) then 

insert the vertex i in C 
else 

remove the links between the vertex i and the 
vertices in C 
end if 
end if 
end for 
end while 
Clean C, TZ and TZ1 
end for 



where Tr{E) is the trace of matrix E and 117511 indi- 
cates the sum of the elements of the matrix E , 

Thus, the splitting of the network considers the value 
of a that provides the highest value of the modularity. 
The pseudocode which describes the hierarchical growth 
method is given in Algorithm 1. 



III. APPLICATIONS 

In this section we illustrate applications of the hierar- 
chical growth to particular problems while analyzing its 
accuracy and the performance. In the first case, its ac- 
curacy is determined by comparing the obtained results 
with expected divisions of different networks. With the 
purpose of determining the performance, we compared 
the hierarchical growth method with Girvan-Newman's 
algorithm, whose implementation is based on the algo- 
rithm developed by Brandes 29] for computing of vertex 
betweenness centrality. 

In order to split the network into communities the 
Girvan-Newman algorithm proceeds as follows: 

1. Calculate the betweenness score for each of the 
edges. 

2. Remove the edge with the highest score. 

3. Compute the modularity for the network. 

4. Go back to step 1 until all edges of the networks 
are removed, resulting in 7Y non-connected nodes. 

The best division is achieved when the highest modu- 
larity value is obtained. In this way, the Girvan-Newman 
method runs in two steps: (i) first all edges are removed 
from the network and the modularity value is computed 
at each removal, (ii) next, the highest value of modularity 
is determined and the corresponding edges removed. 

A. Computer generated networks 

A typical procedure to quantify how well a community 
identification method performs adopts networks with 
known community structure, called computer generated 
networks, which are constructed by using two different 
probabilities (2||. Initially, a set of n vertices are clas- 
sified into c communities. At each subsequent step, two 
vertices are selected and linked with probability pi n if 
they are in the same community, or p out in case they are 
belong to different communities. The values of pi n and 
p ou t can be selected so as to control the sharpness of the 
separation between the communities. Whenpi„ <C Pout, 
the communities can easily be visualized. On the other 
hand, when pi n — > p ou t, it is difficult to distinguish the 
communities and the methods used for community iden- 
tification lose precision in the correct classification of the 
vertices into communities. 

We generated networks with 128 vertices, divided into 
four communities of 32 vertices each. The total average 
vertex degree ki n + k ou t of the network was kept constant 
and equal to 16. In this way, as the value of k ou t is varied 
from to 8, the more difficult the network communities 
recognition becomes. The proposed community finding 
algorithm was applied to each network configuration, and 
the fraction of vertices classified correctly was calculated. 
In Figure [21 it is shown the sensitivity of the hierarchical 
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growth method compared with the results obtained by 
using Girvan-Newman's method. 
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FIG. 3: Processing time versus the size of network. The hi- 
erarchical growth (HG) method runs faster than the Girvan- 
Newman (GN) method. While the time of processing of the 
Girvan-Newman's method scales as jv 3 0±01 , the time of hi- 
erarchical growth method scales as jv 1 ' . Each data point 
is an average over 10 graphs. 



FIG. 2: Fraction of correctly classified vertices in terms of the 
number of inter-community edges k ou t for a network with 128 
vertices considering ki„ + k ou t = 16. The Girvan-Newman's 
method is more precise than the hierarchical growth method 
when k out > 5. Each data point is an average over 100 graphs. 



As Figure |2 shows, the algorithm performs near full 
accuracy when k out < 5, classifying more than 90% of 
vertices correctly. For higher values, this fraction falls 
off as the connections between communities gets denser. 
When k out > 5, the Girvan-Newman's method gives a 
better result, so it tends to be more suitable for this kind 
of networks. 

The execution times of both methods were compared 
considering the computer generated cases for which the 
hierarchical growth method provides exact results (i.e. 
we used kout = 2,3 and 4). We considered the net- 
work size varying from N = 128 until N = 1,024 and 
kept the average degree k in + k out — 16. The hierar- 
chical growth method resulted faster than the Girvan- 
Newman's method, as shown in Figure [3J While the 
Girvan-Newman's processing time scales as jV 3 0±01 , the 
time of the hierarchical growth method scales as TV 1 - 6 * - 1 , 
which suggests that the former method is particularly 
suitable for large networks. 

The constant a considered in the algorithm is deter- 
mined in the following way. The algorithm runs for a 
varying from 1 to a maximum value um increasing in 
steps of 0.5. For each value of a, the communities are 
computed, and the decomposition with the best value of 
modularity is chosen. In our tests, the best value of a 
was always equal to 1 for all network sizes considered. 



B. Zachary karate club network 

In order to apply the hierarchical growth method to a 
real network, we used the popular Zachary karate club 
network |3fjl | , which is considered as a simple benchmark 
for community finding methodologies [2^, H3, ■ This 
network was constructed with the data collected observ- 
ing 34 members of a karate club over a period of 2 years 
and considering friendship between members. The two 
obtained communities are shown in Figure 0] This par- 
titioning of the network corresponds almost perfectly to 
the actual division of the club members, while only one 
vertex, i.e. vertex 3, has been misclassified. This result is 
analogous to that obtained by using the Girvan-Newman 
algorithm based on measuring of betweenness central- 
ly El. 



C. Image segmentation 

A third application of our method is related to the im- 
portant problem of image segmentation, i.e. the partition 
of image elements (i.e pixels) into meaningful areas corre- 
sponding to existing objects. As described by Costa [3l| . 
an image can be modeled as a network and methods ap- 
plied to networks characterization can be used to iden- 
tify image properties. The application of a community 
finding algorithm to image segmentation was proposed in 
that same work j2^. Since digital images are normally 
represented in terms of matrices, where each element cor- 
responds to a pixel, it is possible to associate each pixel 
to a node using network image representation. The edge 
weight between every pair of pixels can be determined by 
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FIG. 4: The friendship Zachary karate club network divided 
into two communities, represented by circles and squares. The 
division obtained by the hierarchical growth is the same as the 
one provided by the Girvan-Newman's method. 




FIG. 5: The real image and its respective segmentation. The 
image is transformed into a network and a threshold T = 0.25 
is used to eliminate weak links. 



the Euclidean distance between feature vectors composed 
by visual properties (e.g. gray-level, color or texture) at 
or around each pixel. Thus, considering the distance be- 
tween every feature vector of pair of pixels in the image, 
this approach results in a fully-connected network, where 
closer pixels are linked by edges with higher weights. To 
eliminate weak links, a threshold can be adopted over 
the weighted network, resulting in a simplified adjacency 
matrix. The connections whose distance is shorter than 
the threshold are assigned to zero, otherwise, to one. 

The mapping between a pixel in the image to a node 
in the network and the reverse operation, is defined |2^| 
by 

i = y+(x - 1)M, (4) 
x = [(t - 1)/MJ + 1, (5) 
y = mod((i-l),M) + l, (6) 

where M is the size of the square image, and 
1 < < M are the pixel positions in the image. 



In this way, the resulting weighted network has N = M 2 
nodes and n — N(N — l)/2 edges. 

Figure [S] shows the initial image and its respective 
segmentation. The results obtained by the hierarchi- 
cal growth method and by using the Girvan-Newman's 
method are similar. Since the network obtained typi- 
cally for images can be substantially large (N = M 2 ), 
a faster method to community identification is necessary 
for practial applications, a demand potentially met by 
hierarchical growth method. 



IV. CONCLUSIONS 

In this paper we have proposed a new method to iden- 
tify communities in networks. The method is based on a 
hierarchical growth from a starting node while its neigh- 
borhood is analyzed, and edges removed according to two 
rules based on the first and/or second neighborhoods of 
the growing community. We have applied this method to 
computer generated networks in order to determine its 
precision and performance comparing it with the pop- 
ular method based on edge betweenness centrality pro- 
posed by Girvan and Newman 18] . Despite resulting 
not so precise as the Girvan-Newman's method, the pro- 
posed algorithm is promisingly fast for determining com- 
munities. We have also applied the hierarchical growth 
method to the Zachary karate club network and image 
segmentation. In both cases, the resulting networks are 
similar to those obtained by the Girvan-Newman's algo- 
rithm. 

As discussed by Danon et al. the most accurate 
methods tend to be computationally more expensive. 
The method presented in this article can not provide as 
good precision as most of the methods, but it yields com- 
peting velocity. As a matter of fact, performance and 
accuracy need to be considered when choosing a method 
for practical purposes. Particularly in the case of im- 
age segmentation, the suggested method is particularly 
suitable given the large size of the typical networks (in- 
creasing with the square of the image size, N — M 2 ) and 
the sharped modular structure often found in images. 

As a future work, the algorithm proposed here can be 
improved considering other conditions to include nodes in 
the growing community as, for example, higher levels of 
community neighborhood. Besides, consideration of local 
modularity can be also considered in order to obtain a 
more precise partition of the network. 
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